Finite-sample frequency distributions originating from an equiprobability distribution 



Thorsten Poschelf] 

Humboldt-Universitdt zu Berlin, Charite, Institut fiir Biochemie, Monbijoustrafle 2, D-lOlll Berlin, Germany 

Jan A. Freuncfl 

Humboldt-Universitdt zu Berlin, Institut fiir Physik, Invalidenstrafle 110, D-10115 Berlin, Germany 

(Dated: February 1, 2008) 

Given an equidistribution for probabilities p{i) = i = 1 . . . A*'. What is the expected cor- 

responding rank ordered frequency distribution /(i), i — 1...N, if an ensemble of M events is 
drawn? 



I. INTRODUCTION 

The probability p(i) to draw an event i from a set of 
N possible events is defined as the limes 



N 



p(z)= lim with 5]M,=M (1) 

J = l 

where /(i) = Mi/M is the relative frequency to find 
the i-th event Mi times in a randomly chosen sample 
of M . According to the law of large numbers the relative 
frequencies stochastically converge to the corresponding 
probabilities, hence, for very large sample size AI ^ N 
we can practically identify both values. For smaller sam- 
ple size, however, the distribution of relative frequencies 
may deviate significantly from the probability distribu- 
tion (e.g. H, II and many others). Assume further the 
equidistribution p{i) = 1/N, then for very large sample 
size M ^ N one expects that all or almost all of the N 
possible events are found in the sample and occur with 
approximately equal frequency, whereas in the opposite 
case M <C iV almost all events occur only once or a few 
times in the sample, i.e. the latter frequency distribu- 
tion will deviate significantly from the former one. From 
this simple argumentation one may conclude that the 
frequency distribution which one expects depends sen- 
sitively on the sample size M. This type of finite size ef- 
fects is of major relevance for statistical analysis of DNA 
and other biosequences, e.g. |5[ ||, Q]. In this article 
we want to calculate the expected frequency distribution 
which one finds in dependence on the sample size M. 

If we draw a frequency distribution of A'^ different 
events there are A^! possibilities to arrange the events 
along the abscissa. An arrangement which leads to a de- 
caying function for the frequencies or probabilities we call 
a Zipf order and the corresponding distribution a Zipf 
ordered distribution. To find a Zipf ordered frequency 
distribution which we have to expect if M events from 
an equidistribution are drawn we have to determine in 



dependence on M how many events (on average) are not 
drawn, i.e. are drawn zero times, how many are drawn 
once, twice etc. 

In the next section we derive the expectation value for 
the number (Ki) of those events which ocurr i times in 
the sample. The analytic expression is then used to infer 
the unknown number A^ of total events and to compare 
the theoretically expected Zipf ordered frequency distri- 
bution with the measured one. The results are useful in 
connection with entropy estimates computed from finite 
samples. 



II. THE NUMBER Ki OF DIFFERENT EVENTS 
EACH OCCURRING EXACTLY i TIMES AND 
ITS EXPECTATION VALUE (K^) 

The result of M subsequent drawings from a set of 
N different equiprobable events can be identified with 
randomly placing M indistinguishable balls in A^ indis- 
tinguishable urns, each having the same probability A^"^. 
Denoting the number of urns containing exactly i balls 
by ki, a possible outcome can be shortly described by 
the vector (fep, ^i, • • • , ^m); this is what we call a clus- 
ter configuration. The number of empty urns is given 
by ko and, consequently, the number of occupied urns by 
N — ko- Any admissible cluster configuration obeys the 
following two conditions 

M 



ki — N (total number of urns) (2) 

i=0 
M 

kii — M (total number of balls) (3) 



We are interested in the stochastic variable Ki, denoting 
the number of urns each filled with exactly i balls, and its 
expectation value (Ki). Introducing for each z = 0, 1, . . . 
and each urn j — 1, . . . , N its related indicator Ii{j) by 
the following definition 



if urn 7 contains exactly i balls , , ^ 
else , (4) 
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the random variable Ki is related to the stochastic indi- 
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expectation operator we find 



N 



N 



Nihil)), (5) 



where we have used that all {Ii{j)) are identical. The 
probability to find exactly i balls in any of the urns (here 
labeled j) and the remaining balls distributed arbitrarily 
among the remaining TV — 1 urns is the binomial distri- 
bution, hence, 



(K,) = N 



1 



1 - 



(M- 



(6) 



The expectation value (Ki) indicates how often events 
in a sample of size M occur exactly i times on average. 
We call these occupation numbers i-clusters. Obviously, 
for small M <^ N nearly all of the N possible events 
are 0-clusters, i.e., they do not occur in our sample. As 
M increases the number of single occupations increases 
as well. For still growing M the number of multiple oc- 
cupation becomes larger and, therefore, the number of 
1-clusters decreases as more and more events occur mul- 
tiple times in the sample. Figure |l| shows the occupation 
of the N possible different events as a function of the 
sample size M. The lines show the theoretical result Eq. 

and the symbols in the left of Fig. |l] show the clusters 
as they have been found in sets of random numbers. 

If we draw only once a sample of size M from a set of 
N possible events and calculate the occupation numbers 
(the cluster frequencies) the cluster distribution itself is 
fluctuating (Fig. ||, filled dots). Averaging the cluster dis- 
tribution over a number of independent selections each of 
size M the cluster distribution converges to the theoret- 
ical curve predicted by Eq. (||) . Figure || shows the clus- 
ter distribution as found from a random sample of size 
M out of iV = 1000 allowed events together with cluster 
distributions averaged over 100 independent drawings. 



III. ZIPF ORDERED FREQUENCY 
DISTRIBUTION 

Equation (^) allows to determine the expectation value 
of the number of different events N* in dependence on 
the total number of drawn events M, which is simply 
related to the expected probability to find a cluster of 
size zero: 



N* =N- (Ko) 
from which we compute 



(7) 



N* 
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M 



= 1 — exp 



Mlog(l-- 



M 



= 1 - exp ( -— ) • cxp 



(8) 
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FIG. 1: Expectation values for clusters of different sizes over 
the sample size M taken from a set of A'^ = 1000 equidis- 
tributed different events. The lines show the theoretical re- 
sult calculated from Eq. The symbols show the cluster 
distribution found from a numerical simulation where M ran- 
dom integers have been drawn from the interval [1, 1000]. The 
lower figure shows the same data for a larger range of sample 
sizes. 



From Eq. (||) we see that for all M < iV^ Eq. 
approximated to very good accuracy by 



N* 
N 



1 — exp 



M 
'TV 



i) can be 



(9) 



a result which has been found before by computer simula- 
tions 1^. The maximal absolute deviation is N* ~ N* — 
1/e « 0.37 (for N = M ^ 1) which falls rapidly to 
l/(2e) « 0.18 as N goes to infinity. In Fig. | the an- 
alytical result (^ is compared with a simulation. The 
impulses in the figure show the results of a single real- 
ization. If we average the numerical results over several 
runs the numerical curve falls together with the analyti- 
cal one. 



For the wide range of practical interest, 5/8 < 
N*/M < 1, from Eq. (|^) we may approximate the en- 
tropy of the distribution if we know the number of dif- 
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FIG. 2: If samples of M events are drawn independently, the 
resulting averaged cluster distribution converges to the theo- 
retical value Eq. (pi). The filled dots show the distribution of 
clusters of size 2 (top) and 6 (bottom) resulting from a sin- 
gle drawing. The diamonds are averaged cluster distributions 
over 100 independent drawings of random numbers. The solid 
lines are the theoretical values according to Eq. mf). 



ferent events N* contained in a sample of size M: 
' (SM + y/MI {SN* - 5M)J A/' 



5«ld 



12 (Af - N*) 



(10) 



We want to compile the results from the previous sec- 
tion to find the desired Zipf ordered frequency distribu- 
tion. Equation (^) tells how many, in average, events do 
not occur in the sample (drawn zero times), how many 
are drawn once, twice, etc. This yields directly the Zipf 
ordered frequency distribution 



' for N>i>N~ (Ko) 
1 for TV - (Ko) >i>N- (Kq) - (A'l) 

J for N - (Kk) >i>N^Y: (Kk) 

k=0 k=0 



Using Stirling's formula to expand the expressions in Eq. 
(^ the analytical result Eq. (Il^) can be written easily 



FIG. 3: Number of different events A'^* in a sample of size 
M drawn from a equidistribution P| = 1 /N. The dashed line 
shows the analytical result Eq. (pi), the impulses show the 
results of a computer simulation, top; A'' — 1000, bottom: 
N = 100. 



in elementary functions. 

Figure ^ shows Zipf ordered frequency distributions 
calculated from a sample of random numbers (dashed 
lines) together with the theoretical distributions due to 
Eq. (|l^) (solid lines). The combinatorial theory de- 
rived in the previous section predicts the Zipf ordered 
frequency distribution which results from an equidistri- 
bution with good accuracy. 



IV. DISCUSSION 

Based on combinatorial considerations we calculated 
the expectation values to find ki events exactly i times 
in a sample of size M which have been drawn from an 
equidistribution. These cluster probabilities allow to es- 
timate the total number of events N, given the num- 
ber of different events found in a sample of size M is 
known. Moreover, the full Zipf ordered frequency distri- 
bution could be constructed. By numerical simulations it 
has been demonstrated that the analytically derived val- 
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FIG. 4: Zipf ordered frequency distributions calculated from 
samples of size M from a equiprobability distribution of 
= 1000 different random numbers (dashed lines). The 
theoretical curve due to Eq. (jl^) is drawn with solid lines. 



Discussions with Werner Ebeling and Rosa Ramirez 
are acknowledged. 



[1] A. O. Schmitt, H. Herzel, and W. Ebeling, Europhys. Lett. 
28, 303 (1993). 

[2] T. Poschel, W. Ebeling, and H. Rose, J. Stat. Phys. 80, 
1443 (1995). 

[3] H. Herzel, A. O. Schmitt, and W. Ebeling, Chaos, Solitons 

& Fractals 4, 97 (1994). 
[4] P. AUegrini, M. Buiatti, P. Grigohni, and B. West, Phys. 

Rev. E 58, 3640 (1998). 
[5] P. Bernaola-Galvan, I. Grosse, P. Carpena, J. Oliver, 



R. Roman-Roldan, and H. E. Stanley, Phys. Rev. Lett. 
85, 1342 (2000). 
[6] D. Holste, I. Grosse, and H. Herzel, Phys. Rev. E 64, 
041917 (2001). 

[7] C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, 
M. Simons, and H. E. Stanley, Phys. Rev. E 47, 3730 
(1993). 

[8] R. Ramirez (1999), private communication. 



